Skip to content

Fix silent drop of return payloads under master backpressure#69197

Open
dwoz wants to merge 1 commit into
3008.xfrom
transportfix
Open

Fix silent drop of return payloads under master backpressure#69197
dwoz wants to merge 1 commit into
3008.xfrom
transportfix

Conversation

@dwoz
Copy link
Copy Markdown
Contributor

@dwoz dwoz commented May 21, 2026

When RequestClient._send_recv pulled a (future, message) pair off the queue whose future had already been marked done — i.e. the per-message timeout fired while the entry waited behind a slow master — the pre-send if future.done(): continue short-circuit discarded the message without ever sending it. The minion's caller saw the same SaltReqTimeoutError it sees in 3007.x, but the master never received the frame, so _return / event_return / external job cache side effects never fired and result stores were left silently incomplete. The reproducer in TRANSPORT_BUG.md shows ~4x delivery loss in 3008.0rc3 vs 3007.14 under the same workload.

The short-circuit was added during the worker-pool routing rewrite (#68532) as a load-shedding optimization. That intent is sound for genuine request/response traffic, but wrong for fire-and-await-ACK payloads like _return: the master persists the side effects from the message itself, regardless of whether the originating future is still alive. Remove the pre-send check from both _send_recv implementations. The remaining post-send future.done() handling and the recv-loop's elif future.done(): break preserve the back-pressure guarantees (POLLOUT/POLLIN poll budgets, reconnect on stale futures) — same control flow as 3007.x.

Add a regression test that pre-stages 20 stale-future entries on the queue and asserts every one reaches the mocked wire. Fails on the unpatched source with 0/20 sends recorded.

Also fix test_client_timeout_msg to use a guaranteed-unused local port instead of the default 4506 — the test assumes nothing is listening there, which collides with any host running a real salt-master.

When RequestClient._send_recv pulled a (future, message) pair off
the queue whose future had already been marked done — i.e. the
per-message timeout fired while the entry waited behind a slow
master — the pre-send `if future.done(): continue` short-circuit
discarded the message without ever sending it. The minion's caller
saw the same SaltReqTimeoutError it sees in 3007.x, but the master
never received the frame, so _return / event_return / external job
cache side effects never fired and result stores were left silently
incomplete. The reproducer in TRANSPORT_BUG.md shows ~4x delivery
loss in 3008.0rc3 vs 3007.14 under the same workload.

The short-circuit was added during the worker-pool routing rewrite
(#68532) as a load-shedding optimization. That intent is sound for
genuine request/response traffic, but wrong for fire-and-await-ACK
payloads like _return: the master persists the side effects from
the message itself, regardless of whether the originating future
is still alive. Remove the pre-send check from both _send_recv
implementations. The remaining post-send `future.done()` handling
and the recv-loop's `elif future.done(): break` preserve the
back-pressure guarantees (POLLOUT/POLLIN poll budgets, reconnect
on stale futures) — same control flow as 3007.x.

Add a regression test that pre-stages 20 stale-future entries on
the queue and asserts every one reaches the mocked wire. Fails on
the unpatched source with 0/20 sends recorded.

Also fix test_client_timeout_msg to use a guaranteed-unused local
port instead of the default 4506 — the test assumes nothing is
listening there, which collides with any host running a real
salt-master.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test:full Run the full test suite

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants